Problem Assessment
Our assessment consists of on a geolocation information of a set of 1,752 wireless subscribers for a phone company in various cells across Nova Scotia. The historical data was provided as of September 1st, 2020. We have also been provided a dataset containing information for the Poseidon CTC Mall and its geofence cell list.
By using the data science toolbox, we need to understand how much average time the subscribers are spending at the Poseidon CTC Mall. The goal of the project is to create an executive summary and create business insights that will help the marketing team to achieve this objective.
Business Problem
Recently, there has been several discussions within our marketing team to develop a strategy to approaching our clients more efficiently, accordingly to the time that they spend in the current Poseidon CTC Mall. As a team, we need to understand what should be the better time frame in which we can target those clients by using the average dwell time, which is a metric to show the “amount of time” a device uses a particular channel.
Our current point of interest (POI) to develop this strategy, corresponds to the Poseidon CTC Mall.
Business Value
By developing a strategy to target our clients efficiently, we are creating organic growth to the business which is achieved by enhancing our sales revenue. By putting in production our final data product, we will be able to get to know our clients better and improve our Voice of Customer.
Tools used
The tools used for this project are:
Python Programming: Python provides a vast set of utilities do develop data science and machine learning through the most prominent packages: Pandas, Scikit-Learn, Numpy and others.
R - Reticulate: Reticulate is a tool to run both Python and R on the same environment
R Markdown: R Programming (through RStudio) provides a set of tools to publish our work using documents and HTML sources, to provide better presentations to our executives or audience.
ASSESSMENT PART 1
1. Loading packages
First we load reticulate package to write Python / R code in our Markdown enviroment:
# Use only once:
#Sys.setenv(RETICULATE_PYTHON = "/usr/local/bin/python3.7")
#library(reticulate)
#use_python('/usr/bin/python3', require = T)
Second, we import Python packages
import pandas as pd # package for data wrangling and transformation
import numpy as np # package for linear algebra
import seaborn as sns # advanced graphical interface package
import matplotlib.pyplot as plt # basic graphical interface package
import copy
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans
2. Collecting & Preparing Data
The fist step is to get the data which has been stored in a csv format. In addition to the files already provided, I was able to find supporting information that would help our analysis at https://mcc-mnc-list.com/list.
With the cgi dataset I was able to get information about:
MCC: Mobile Country Code MNC: Mobile Network Code LAC: Location Area Code CI: Cell Identity
# geo location data
geo = pd.read_csv('/home/analytics/R/Projects/Python/datasets/telco/geolocation_data.csv')
# Poseidon CTC Mall Geofence data
poseidon = pd.read_csv('/home/analytics/R/Projects/Python/datasets/telco/Poseidon CTC Mall Geofence Cell List.csv')
# cgi data
cgi = pd.read_csv('/home/analytics/R/Projects/Python/datasets/telco/cgi.csv')
Checking shape and data structure
geolocation data
print(geo.shape)
## (408917, 3)
geo.head()
## imsi_id ... cgi_id
## 0 778fb33d64b6dcbc67c79ca6d35f7820f993f26c74502c... ... 4G:302-220-113919-131
## 1 753c0c75ac582307c3b0d04bbe4cfe7d118684d1b4060e... ... 4G:302-220-113155-41
## 2 05f8982252e683f1fed1f3ebd42f60848587a4e270ddcc... ... 4G:302-220-113937-10
## 3 d5ce6f7f332850a23ae4700231c87dec634a1e7832304d... ... 302-220-11201-23458
## 4 05f8982252e683f1fed1f3ebd42f60848587a4e270ddcc... ... 4G:302-220-113937-10
##
## [5 rows x 3 columns]
poseidon data
print(poseidon.shape)
## (105, 1)
poseidon.head()
## Poseidon CTC Mall Geofence Cell List
## 0 4G:302-220-113122-132
## 1 4G:302-220-113122-42
## 2 4G:302-220-113122-112
## 3 4G:302-220-113122-2
## 4 4G:302-220-113122-131
poseidon['Poseidon CTC Mall Geofence Cell List'].nunique()
## 75
We have 75 unique cells (that fences the POI of “Poseidon CTC Mall”) to which we are interested in. So we drop the duplicate values and have it ready to be merged:
poseidon.drop_duplicates(subset=['Poseidon CTC Mall Geofence Cell List'], inplace = True)
poseidon.shape # indeed we have 75 cells
## (75, 1)
poseidon.head()
## Poseidon CTC Mall Geofence Cell List
## 0 4G:302-220-113122-132
## 1 4G:302-220-113122-42
## 2 4G:302-220-113122-112
## 3 4G:302-220-113122-2
## 4 4G:302-220-113122-131
CGI(Cell Global Identity)
cgi.shape
## (3, 6)
cgi.head()
## MCC ... Operator
## 0 302 ... Telus Mobility
## 1 302 ... Bell Mobility
## 2 302 ... Shared Telus, Bell, and SaskTel
##
## [3 rows x 6 columns]
Converting data types:
cgi.MCC = cgi.MCC.astype('object')
cgi.MNC = cgi.MNC.astype('object')
3. Merging the data
df = pd.merge(poseidon, geo, left_on = 'Poseidon CTC Mall Geofence Cell List', right_on = 'cgi_id')
df.head()
## Poseidon CTC Mall Geofence Cell List ... cgi_id
## 0 4G:302-220-113122-132 ... 4G:302-220-113122-132
## 1 4G:302-220-113122-132 ... 4G:302-220-113122-132
## 2 4G:302-220-113122-132 ... 4G:302-220-113122-132
## 3 4G:302-220-113122-132 ... 4G:302-220-113122-132
## 4 4G:302-220-113122-42 ... 4G:302-220-113122-42
##
## [5 rows x 4 columns]
df.shape
## (432, 4)
No of unique clients to be analized: 33
df.imsi_id.nunique()
## 33
4. Data verification / cleaning
telco_nulls = df.isna().sum()[df.isna().sum()!= 0].reset_index().rename(columns=
{'index': 'col_name', 0: '#'})
telco_nulls['%'] = telco_nulls['#'] / df.shape[0]
telco_nulls
## Empty DataFrame
## Columns: [col_name, #, %]
## Index: []
No NULL values were found on the data.
5. Data Transformation & Feature Engineering
Extract Information from CGI: MCC - MNC - LAC - CI
new = df["cgi_id"].str.split("-", n = 4, expand = True)
broad_MCC = pd.DataFrame(new[0])
# If broadband does not contain 4G then place "other"
broad_MCC[1] = broad_MCC[0].map(lambda x: "4G" if "4G" in x else 'Other')
Beacuse MCC is always a 3 digit character, then we extract the last 3 characters:
broad_MCC[2] = broad_MCC[0].str[-3:]
broad_MCC[2].nunique() #indeed we have only 1 value.
## 1
broad_MCC.head()
## 0 1 2
## 0 4G:302 4G 302
## 1 4G:302 4G 302
## 2 4G:302 4G 302
## 3 4G:302 4G 302
## 4 4G:302 4G 302
df['broadband'] = broad_MCC[1]
df['MCC'] = broad_MCC[2]
df['MNC'] = new[1]
df['LAC'] = new[2]
df['CI'] = new[3]
df.head()
## Poseidon CTC Mall Geofence Cell List ... CI
## 0 4G:302-220-113122-132 ... 132
## 1 4G:302-220-113122-132 ... 132
## 2 4G:302-220-113122-132 ... 132
## 3 4G:302-220-113122-132 ... 132
## 4 4G:302-220-113122-42 ... 42
##
## [5 rows x 9 columns]
Dropping unnessary columns
df.drop(columns=['Poseidon CTC Mall Geofence Cell List', 'cgi_id'], inplace = True)
df.head()
## imsi_id ... CI
## 0 b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8... ... 132
## 1 2f65da3d1f0ed00d831a2037868732a262523d6231e9bf... ... 132
## 2 450c1b07f639674e3e004251fb86ce273ac21bc1c1754e... ... 132
## 3 2f627a032a70455a80760c51c0c8cc9282c87f52a25703... ... 132
## 4 044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf... ... 42
##
## [5 rows x 7 columns]
converting data types to able to merge data
df.MCC = df.MCC.astype('str')
df.MNC = df.MNC.astype('str')
cgi.MCC = cgi.MCC.astype('str')
cgi.MNC = cgi.MNC.astype('str')
telco = pd.merge(df, cgi, how = 'left', left_on = ['MCC', 'MNC'], right_on = ['MCC', 'MNC' ])
telco.head()
## imsi_id ... Operator
## 0 b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8... ... Telus Mobility
## 1 2f65da3d1f0ed00d831a2037868732a262523d6231e9bf... ... Telus Mobility
## 2 450c1b07f639674e3e004251fb86ce273ac21bc1c1754e... ... Telus Mobility
## 3 2f627a032a70455a80760c51c0c8cc9282c87f52a25703... ... Telus Mobility
## 4 044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf... ... Telus Mobility
##
## [5 rows x 11 columns]
telco.shape
## (432, 11)
As we can see, we have merged the data correctly. Some columns showed previously, was to show the type of information that we have in our hands. For instance, Country and Brand: They show the country that we are analyzing, and brand has similarity with the operator. ISO also contains “CA”.
From this point on, we are going to these 2 columns:
telco.drop(columns=['ISO', 'Country', 'Brand'], inplace = True)
telco.head()
## imsi_id ... Operator
## 0 b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8... ... Telus Mobility
## 1 2f65da3d1f0ed00d831a2037868732a262523d6231e9bf... ... Telus Mobility
## 2 450c1b07f639674e3e004251fb86ce273ac21bc1c1754e... ... Telus Mobility
## 3 2f627a032a70455a80760c51c0c8cc9282c87f52a25703... ... Telus Mobility
## 4 044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf... ... Telus Mobility
##
## [5 rows x 8 columns]
6. Exploratory Data Analysis
Our data exploration consists in analyzing our categorical & numerical variables. Next, we will be focusing on the time-based analysis.
6.1. Client Analysis
telco.imsi_id.nunique()
## 33
telco.shape[0]
## 432
Insight 1: On September 1st 2020, we had a total of 432 visits covered by our current POI (located at Poseidon CTC Mall, Canada). In total, the POI served 33 unique clients.
6.2. Broadband Analysis
sns.countplot(telco.broadband)
round(100 * telco.groupby(['broadband']).size() / telco.shape[0])
## broadband
## 4G 34.0
## Other 66.0
## dtype: float64
Insight 2: Our POI demanded 34% of 4G broadband and 66% of the rest. This 66% was catalogued as “other” since there was no specification on the dataset about the broadband covered on this particular points.
6.3. MCC / MNC / LAC / CI Analysis
telco.groupby(['MCC']).size()
## MCC
## 302 432
## dtype: int64
telco.groupby(['MNC']).size()
## MNC
## 220 431
## 880 1
## dtype: int64
sns.countplot(telco.LAC)
plt.show()
LAC = pd.DataFrame()
LAC['n'] = telco.groupby(['LAC']).size()
LAC['%'] = round(100 * (LAC['n'] / telco.shape[0]))
LAC
## n %
## LAC
## 11204 285 66.0
## 113122 104 24.0
## 123353 43 10.0
CI = pd.DataFrame()
CI['n'] = telco.groupby(['CI']).size().sort_values(ascending = False)
CI['%'] = round(100 * (CI['n'] / telco.shape[0]))
CI
## n %
## CI
## 24122 111 26.0
## 24353 79 18.0
## 21122 50 12.0
## 41 48 11.0
## 25122 43 10.0
## 111 37 9.0
## 131 23 5.0
## 1 18 4.0
## 42 5 1.0
## 132 4 1.0
## 2 4 1.0
## 112 3 1.0
## 71 3 1.0
## 22122 2 0.0
## 11 1 0.0
## 22 1 0.0
Insight 4: In addition to the previous insight, our group of cell towers is divided in 3 categories: 11204, 113122 and 123353. There was a high demand on the LAC 11204 (285 times) of about 66%; this is the most critical group so far. Our POI registered a total of 16 cells serving our Poseidon CTC Mall and most demanded CIs (Cells Ids) were: 24122 (26%), 24353 (18%) and 21122 (12%). The rest is serving around 44% of the time.
6.4 Time-Based Data (event_ts)
Formatting time-based data (convert date time to minutes)
telco.event_ts = df.event_ts.astype('datetime64[ns]')
telco = telco.sort_values(by = ['event_ts'], ascending = True)
time = pd.DatetimeIndex(telco.event_ts) # converts to datetime object index
telco['hours'] = (time.hour + time.minute / 60 + time.second / 3600 )
telcov2 = copy.deepcopy(telco)
telco.head(10)
## imsi_id ... hours
## 61 9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57... ... 0.000000
## 52 9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57... ... 0.000000
## 70 9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57... ... 0.000000
## 149 9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5... ... 0.550000
## 148 9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5... ... 0.550000
## 165 9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5... ... 0.566667
## 166 9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5... ... 0.566667
## 171 e41a1559c092dfa12acf39072821ce575665fdc4c2a46c... ... 0.616667
## 186 d6a5b12c28dc324c1ff61b2540336c0fbe3ac89b1744be... ... 0.800000
## 155 d6a5b12c28dc324c1ff61b2540336c0fbe3ac89b1744be... ... 0.800000
##
## [10 rows x 9 columns]
output = telco.groupby('imsi_id').agg(
dwell_time = ('hours', 'mean' ),
r_visits = ('hours', 'count')
).sort_values(by = ['dwell_time'], ascending = True).reset_index()
Next, we present the list of our 33 clients according to their dwell_time and the number of recurrent visits by each one of them:
pd.set_option('display.max_rows', 50)
output
## imsi_id dwell_time r_visits
## 0 9dd48ceccaea12ad1ed02afa45f096bba2409f1f5e04d5... 0.558333 4
## 1 d6a5b12c28dc324c1ff61b2540336c0fbe3ac89b1744be... 0.800000 2
## 2 9366f7e222a3682413fc4b3480289bf5ee8bee9f5c9354... 1.316667 1
## 3 f48b2e8df1f9f912c788b57e6f2c3246e08b74440f033e... 1.966667 6
## 4 f70cd60e8a80eebd2422c0db3d8fb5d9b441bb6ef22c71... 10.123333 5
## 5 e41a1559c092dfa12acf39072821ce575665fdc4c2a46c... 10.822917 16
## 6 10ef0c74bc117442d7af58334e1a7f6d7e6e374a705e1f... 11.524242 11
## 7 9f6dbb4957e34ad65b4061e1ac326afa539752db6a1f57... 12.148148 9
## 8 c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6... 13.783965 132
## 9 e2ee0f057e15ceedbb10b6c7ba335487fd20cfb7b1ca55... 13.992857 7
## 10 12f577a9e7c23830cfc662a8954126c96a2f966c248457... 14.255556 12
## 11 d993bd44520ea984f45182b7398c98a2af4e0da09ef423... 14.733333 1
## 12 321b7cf31816a63fb196a208c3eeea20de04cd59dc35d8... 15.870000 5
## 13 044aa99b8a013c64e3924cf76dec7c9a25d7315bcaccaf... 15.920833 4
## 14 cbc12ecb1aa2a955a6426a3114f1a3230ff618560be98a... 16.233333 5
## 15 2112b7c11b6fd3e8b5c78af39566e4834685bdcaf56c6b... 16.600000 4
## 16 cfef7cf475779a432e73e64641bc5d2abb58e5eebc15ce... 17.488889 3
## 17 7948cc9c2c4ede47301ef73252a41fdf9b8e07817f5604... 17.503604 37
## 18 2f65da3d1f0ed00d831a2037868732a262523d6231e9bf... 17.633333 2
## 19 16d323326378966732abaafddad4027288389ee110cbd1... 17.683333 3
## 20 f85657aee6b4977e8909ea8de56f26d3d7010a40e5d210... 17.773232 33
## 21 f043c2b9f621827a588b37bf25f7a28f41ce3ce5928dbb... 18.033333 1
## 22 b58217bd1fe520ae4cb6a2228ea1cffa68590965cc5ba8... 18.561538 13
## 23 05b5aa3eb6db1abda56d8cd0f15e6f5ce2307d16cc9e05... 19.327778 3
## 24 9b03a42cf240abbb766ddbe3e5091bfc1cc33c0bf08409... 19.576316 19
## 25 2f627a032a70455a80760c51c0c8cc9282c87f52a25703... 19.905556 18
## 26 2acb2650d37de9efe5ef71d02c0f9ac8dba554f6f70370... 19.916667 12
## 27 16a2a5883d4b3b6f5d4f688e43887bbf712769c612b9da... 19.983333 1
## 28 1581807a0d2e3c44b53791d8601ae83e0e1ac51142d84d... 21.066667 2
## 29 bc9e5df525777dbe77e262794de72923365aa37b3c019f... 21.129902 34
## 30 85d444651f5123ca9e97d30ff30a40f0a65d0487ac37df... 21.250000 4
## 31 450c1b07f639674e3e004251fb86ce273ac21bc1c1754e... 23.471111 15
## 32 059e71d6579dadcfa0cef78ee779b671ac254549a2b4bb... 23.843750 8
Insight 5: Our greatest potential client registered on September 1st, 2020 was:
output[output.r_visits == output.r_visits.max()]
## imsi_id dwell_time r_visits
## 8 c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6... 13.783965 132
In addition, we can target this client in business hours from 1 PM - 2 PM.
Insight 6: Our most concurrent currents registered in our Poseidon CTC Mall POI were:
output.nlargest(3, 'r_visits')
## imsi_id dwell_time r_visits
## 8 c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6... 13.783965 132
## 17 7948cc9c2c4ede47301ef73252a41fdf9b8e07817f5604... 17.503604 37
## 29 bc9e5df525777dbe77e262794de72923365aa37b3c019f... 21.129902 34
7. DWELL TIME AVERAGE ANALYSYS
Displaying the distribution of dwell time by the number of recurrent visits:
output_bin = copy.deepcopy(output)
output_bin['points_bin'] = pd.qcut(output_bin['dwell_time'], q=8)
#bining the data
output_bin = output_bin.groupby('points_bin').agg(
r_visits = ('r_visits', 'sum')
).reset_index()
#plotting the data:
plt.figure(figsize = (15, 5))
sns.barplot(x = output_bin.points_bin, y = output_bin.r_visits, color = 'blue')
# Labeling of the plot
plt.xlabel('dwell time (hrs)')
plt.ylabel('Recurrent visits')
plt.title('Distribution of dwell time (Poseidon CTC Mall)')
plt.show()
Insight 6 (a): Busiest hours at Poseidon CTC Mall on Sept 1st 2020 occurred between 10 hrs and 13 hrs, with a number of recurrent > 160. Particularly, this is the segment that our marketing work force should focus on.
Displaying the relationship between dwell time and the number of recurrent visits:
plt.figure(figsize = (15, 5))
sns.scatterplot(x = output.dwell_time, y = output.r_visits)
# Labeling of the plot
plt.xlabel('dwell time (hrs)')
plt.ylabel('Recurrent visits')
plt.title('Dwell_time vs Recurrent visits (Poseidon CTC Mall)')
plt.show()
output[output.r_visits == output.r_visits.max()]
## imsi_id dwell_time r_visits
## 8 c6f43f7ae49f220a883f6b63b26024eff38ce3c2e773a6... 13.783965 132
Insight 6 (b): We can see a potential outlier which corresponds to the previous client stated before. Apart from that, the number of recurrent visits seems to stay below 40. There are interesting points during the very early mornings. Probably these are people attenting on security? Probably there is pretty nightlife in the place? Considering if there are any cinema theater located at the mall, most probably these are late schedules due to movie premieres.
ts = copy.deepcopy(telcov2)
ts['cnt'] = 1
#ts = ts.groupby('event_ts').agg({"cnt": "sum"}).reset_index()
ts.event_ts = ts.event_ts.astype(str)
8. Client Trend Analysis
Loading R Packages
library(reticulate)
ts_data <- py$ts
# Forecasting Libraries ----
library(forecast)
library(tidymodels)
library(modeltime)
library(tidyverse)
library(lubridate)
library(timetk)
library(plotly)
#renaming colums
ts_data$Date <- ts_data$event_ts
#converting to datetime
ts_data$Date <- as.POSIXct(as.character(ts_data$Date), format = "%Y-%d-%m %H:%M:%S")
ts_data <- as_tibble(ts_data)
Plotting the data
#summarize by hour
p1 <- ts_data %>% group_by(Date) %>%
summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>%
plot_time_series(Date, no_visits, .smooth = FALSE,
.title = "Poseidon CTC Malls Visits Trend - Sep 1st 2020", .y_lab = "count of client visits" )
p1
Insight 8: Clearly, we can spot an increasing trend regarding the number of visits by our in our Poseidon CTC Mall. By the end of the day, we have registered 54 unique clients.
Lets understand the patterns of our both broadband technologies:
#summarize by hour
p2 <- ts_data %>% group_by(broadband) %>%
summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>%
plot_time_series(Date, no_visits, .smooth = TRUE,
.color_var = broadband ,
.title = "Poseidon CTC Broadband types - Sep 1st 2020",
.y_lab = "count of client visits", .facet_ncol = 2,
.facet_scales = "free" )
p2
Insight 9: 4G broadband technology is not equaly consumed in our POI. It will be interesting to know which are these ‘other’ broadband technologies.
Lets understand the patterns in each CELL that is consumed by our LAC Code:
#summarize by hour
p3 <- ts_data %>% group_by(LAC) %>%
summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>%
plot_time_series(Date, no_visits,
.title = "Poseidon CTC Mall by LAC - Sep 1st 2020",
.y_lab = "count of client visits", .facet_ncol = 2,
.facet_scales = "free" )
p3
Insight 10: Local Are Code 11204 is highly demanded in several houurs of the day. LAC 123353 doesn’t seem to show any particular movement except for the end of the day, specially at 7:00 PM.
Let’s see our top (most consumed) CIs:
#summarize by hour
p4 <- ts_data %>% group_by(CI) %>%
summarize_by_time(.date_var=Date, .by = 'hour', no_visits = sum(cnt) ) %>% arrange(desc(no_visits)) %>%
filter(CI %in% c('111', '24122', '131', '24122', '41')) %>%
plot_time_series(Date, no_visits, .smooth = FALSE,
.title = "Poseidon CTC Mall by CI - Sep 1st 2020",
.y_lab = "count of client visits", .facet_ncol = 1,
.facet_scales = "free" )
p4
Insight 11: Apparently, not al CIs are highly during the day. For instance CI 111 only shows movement during the last hours of the nigh, and CI 31 shows fluctuations betwee 12::00 hrs till the end of the day.
ASSESSMENT PART 2
Once the average dwell times analysis is complete, cluster the subscribers seen inside Poseidon CTC Mall into various segments, and explain the various segments. We would also like to know how did you select the #segments (or K value), if at all.
1. Selecting the data
Storing the data for the columns: dwell_time and r_visits (by each distinct client):
output_2 = output
X = output_2.iloc[:, [1,2]].values
2. Model Building
# By applying K-Means algorithm, we are going to find the best cluster number
# The idea, is to reduce WCSS (Within - Cluster - Sum of Squares)
from sklearn.cluster import KMeans
wcss = []
# Getting the maximun numbers of clusters (to begin with, we set 10 by default)
# According to the official documentation of scikit learn, we use:
#‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up
#convergence.
# random_state = 123, just be sure a fixed randomness
# Inertia tell us calculates the sum of distances of all the points within a cluster from the centroid of that
# cluster
# by default, we chose to select an 'auto' algorithm for converging. Euclidean distance is used in this case,
# but it might change according to our performance.
for i in range(1,11):
kmeans = KMeans(n_clusters= i, max_iter = 300,
init='k-means++', random_state=123,
algorithm='auto')
kmeans.fit(X)
wcss.append(kmeans.inertia_)
## KMeans(n_clusters=1, random_state=123)
## KMeans(n_clusters=2, random_state=123)
## KMeans(n_clusters=3, random_state=123)
## KMeans(n_clusters=4, random_state=123)
## KMeans(n_clusters=5, random_state=123)
## KMeans(n_clusters=6, random_state=123)
## KMeans(n_clusters=7, random_state=123)
## KMeans(random_state=123)
## KMeans(n_clusters=9, random_state=123)
## KMeans(n_clusters=10, random_state=123)
3. Visualizing results
# Viewing ELBOW method to get the optimal value of K
plt.plot(range(1,11), wcss)
plt.title('Elbow Method')
plt.xlabel('No. of clusters')
plt.ylabel('wcss')
plt.show()
From the graph above, we can see that according to the “Elbow validation approach”, if we choose 6 clusters, then we will achieving a minium error of wcss
#Model Build
kmeans_model = KMeans(n_clusters= 6, max_iter = 300,
init='k-means++', random_state=123,
verbose = 1, algorithm='auto')
output_2 ['y_pred'] = kmeans_model.fit_predict(X)
## Initialization complete
## Iteration 0, inertia 476.2443094109775
## Iteration 1, inertia 305.4512442082231
## Converged at iteration 1: strict convergence.
## Initialization complete
## Iteration 0, inertia 545.1293060336468
## Iteration 1, inertia 312.2361716096957
## Iteration 2, inertia 306.7580124449414
## Converged at iteration 2: strict convergence.
## Initialization complete
## Iteration 0, inertia 450.11941478483624
## Iteration 1, inertia 371.25146932287197
## Iteration 2, inertia 315.6318486261001
## Iteration 3, inertia 306.7580124449414
## Converged at iteration 3: strict convergence.
## Initialization complete
## Iteration 0, inertia 437.3813717106833
## Iteration 1, inertia 354.67313867714694
## Iteration 2, inertia 306.7580124449414
## Converged at iteration 2: strict convergence.
## Initialization complete
## Iteration 0, inertia 483.9692655293205
## Iteration 1, inertia 360.8337449127447
## Iteration 2, inertia 310.94913777733245
## Iteration 3, inertia 305.45124420822316
## Converged at iteration 3: strict convergence.
## Initialization complete
## Iteration 0, inertia 544.6388108996249
## Iteration 1, inertia 354.6731386771469
## Iteration 2, inertia 306.7580124449414
## Converged at iteration 2: strict convergence.
## Initialization complete
## Iteration 0, inertia 448.99713376178613
## Iteration 1, inertia 306.7580124449414
## Converged at iteration 1: strict convergence.
## Initialization complete
## Iteration 0, inertia 408.68816905742136
## Iteration 1, inertia 306.7580124449414
## Converged at iteration 1: strict convergence.
## Initialization complete
## Iteration 0, inertia 548.0569280821813
## Iteration 1, inertia 404.6820345169024
## Iteration 2, inertia 383.5558342152465
## Iteration 3, inertia 373.7482806253318
## Converged at iteration 3: strict convergence.
## Initialization complete
## Iteration 0, inertia 520.4707242606468
## Iteration 1, inertia 346.1610231213367
## Iteration 2, inertia 336.1378259792391
## Iteration 3, inertia 311.73039202918505
## Iteration 4, inertia 306.7580124449414
## Converged at iteration 4: strict convergence.
# plot results
plt.figure(figsize=(10,6))
plot_clusters = sns.scatterplot(x='dwell_time', y='r_visits', hue='y_pred',palette='Set1', s=100, alpha=0.2,
data=output_2).set_title('KMeans Clusters (6)', fontsize=15)
plt.show()
By looking at the previous graph we are able to segment our customers using 4 clusters. It’s important to remark the most of our clients stay below 40 visits a day. There is one particular cluster (1) that was highly skewed greater than 120. This shows that from now on, any other potential client which behaves similarly will belong to this particular cluster.
FINAL WORDS AND RECOMMENDATIONS
(1) By analyzing the performance of our Poseidon CTC PO dwell time, we were able to understand the patterns and relationships on our data that will help the marketing department to know and approach better to our clients. We were capable to segment the most prominent clients in our current portfolio. This will also allow the sales leaders to focus specifically on segments in which they can target new customers or making offerings to the current ones.
(2) By understanding our broadband / MNC / LAC, we were able to find the frequency and cyclical patterns in our client behavior. That would help the marketing leaders to focus specifically in particular times in the current day.
(3) We used machine learning to find a solution in clustering our current clients according to their dwell time and count of visits. If there are any new clients, our algorithm will be capable to assign a cluster to a particular client. It’s highly recommended that we keep up updating our models periodically, since the structure, behavior of the clients and systematic events could affect the performance of those models.
(4) Next steps are to publish put our machine learning in production or embed the results into a dashboard. As per the time series analysis, is also recommended to build a forecasting model capable to predict the number of clients demanded in our current Poseidon CTC PO.